-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG-22796 Concat multicolumn tz-aware DataFrame #23036
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @tonytao2012! Thanks for updating the PR.
Comment last updated on October 09, 2018 at 12:08 Hours UTC |
@@ -53,6 +53,21 @@ def test_concat_multiple_tzs(self): | |||
expected = DataFrame(dict(time=[ts2, ts3])) | |||
assert_frame_equal(results, expected) | |||
|
|||
def test_concat_tz_NaT(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add a test where ts1 is NaT
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, it results in an object column instead of the correct dtype. I'm not entirely sure how to solve this issue, as it appears to happen when the dataframe is created before the concat even occurs.
ts1 = pd.Timestamp(pd.NaT, tz='UTC')
df1 = pd.DataFrame([[ts1, ts2]])
df1[0]
Out[37]:
0 NaT
Name: 0, dtype: datetime64[ns]
Any ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That behavior is theoretically correct if NaT
was replaced by a naive Timestamp. NaT
is an edge case where it can exist with a datetime64[ns, tz]
dtype.
Don't worry about solving that in this PR. Feel free to open up another issue about that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, should I still add the new test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add the test with ts1 as NaT and xfail it, but link it to a new issue.
The bug you're describing may be #12499 |
Codecov Report
@@ Coverage Diff @@
## master #23036 +/- ##
==========================================
- Coverage 92.19% 92.19% -0.01%
==========================================
Files 169 169
Lines 50904 50911 +7
==========================================
+ Hits 46933 46939 +6
- Misses 3971 3972 +1
Continue to review full report at Codecov.
|
Added the xfail test and linked to issue #23037. Thanks for the feedback. |
@@ -53,6 +55,24 @@ def test_concat_multiple_tzs(self): | |||
expected = DataFrame(dict(time=[ts2, ts3])) | |||
assert_frame_equal(results, expected) | |||
|
|||
@pytest.mark.parametrize('t1', ['2015-01-01', | |||
pytest.param('pd.NaT', marks=pytest.mark.xfail( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is pd.NaT
in quotes? Shouldn't it just be pd.NaT
?
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -883,6 +883,7 @@ Reshaping | |||
- Bug in :func:`pandas.wide_to_long` when a string is passed to the stubnames argument and a column name is a substring of that stubname (:issue:`22468`) | |||
- Bug in :func:`merge` when merging ``datetime64[ns, tz]`` data that contained a DST transition (:issue:`18885`) | |||
- Bug in :func:`merge_asof` when merging on float values within defined tolerance (:issue:`22981`) | |||
- Bug in :func:`pandas.concat` when merging multicolumn DataFrames with tz-aware data (:issue`22796`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you be more specific here? The isn't an issue with all multi-column DataFrames with tz-aware data. What are the exact conditions needed?
pandas/core/internals/concat.py
Outdated
@@ -186,6 +188,11 @@ def get_reindexed_values(self, empty_dtype, upcasted_na): | |||
|
|||
if getattr(self.block, 'is_datetimetz', False) or \ | |||
is_datetimetz(empty_dtype): | |||
if self.block is None: | |||
missing_arr = np.full(self.shape, fill_value) | |||
missing_time = DatetimeIndex(missing_arr[0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u walk back up the stack and see exactly where block is None- this is a guarantee iirc of this function to not be None so this might be an error higher up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Block is none when there are a mismatched number of columns between the two concatenating dataframes. I assumed this was correct behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when I last changed this, I was really trying NOT to use DTI directly here as we are trying isolate things like that to higher level methods.
See if you can revise, otherwise I will take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, in that case, I'm not sure what to do. Clearly, block is supposed to be None in cases where the columns are mismatched:
for blkno, placements in libinternals.get_blkno_placements(blknos,
mgr.nblocks,
group=False):
assert placements.is_slice_like
join_unit_indexers = indexers.copy()
shape = list(mgr_shape)
shape[0] = len(placements)
shape = tuple(shape)
if blkno == -1:
unit = JoinUnit(None, shape)
When block is None, we have to create and return some array in get_reindexed_values, but np arrays can't have a tz dtype. I apologize if I'm missing something obvious. Feel free to take over the issue if you'd like, as I'm unsure of how to continue from here. I'll also keep thinking about it.
pandas/core/internals/concat.py
Outdated
@@ -186,6 +188,11 @@ def get_reindexed_values(self, empty_dtype, upcasted_na): | |||
|
|||
if getattr(self.block, 'is_datetimetz', False) or \ | |||
is_datetimetz(empty_dtype): | |||
if self.block is None: | |||
missing_arr = np.full(self.shape, fill_value) | |||
missing_time = DatetimeIndex(missing_arr[0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when I last changed this, I was really trying NOT to use DTI directly here as we are trying isolate things like that to higher level methods.
See if you can revise, otherwise I will take a look.
Just to make sure, on this PR I get In [1]: import pandas as pd
In [2]: a = pd.DataFrame({"A": pd.to_datetime([1, 2]).tz_localize("UTC")})
In [3]: b = pd.DataFrame({"A": pd.to_datetime([1, 2]).tz_localize("UTC"), "B": pd.to_datetime([1, 2]).tz_localize("UTC")})
In [4]: pd.concat([a, b], sort=True).B.values
Out[4]:
array([ 'NaT', '1970-01-01T00:00:00.000000001',
'1970-01-01T00:00:00.000000002'], dtype='datetime64[ns]')
In [5]: pd.concat([a, b], sort=True)
Out[5]:
A B
0 1970-01-01 00:00:00.000000001+00:00 NaT
1 1970-01-01 00:00:00.000000002+00:00 1970-01-01 00:00:00.000000001+00:00
0 1970-01-01 00:00:00.000000001+00:00 1970-01-01 00:00:00.000000002+00:00
1 1970-01-01 00:00:00.000000002+00:00 the xfail you added @jreback is for the fact that |
hmm, that does look suspicious but that's not the reason for the xfail. |
pushed an updated, that was a just-introduced bug. |
one more time to fix the lint issue. |
thanks @tonytao2012 welcome to have you work on the issue that we are xfailing |
git diff upstream/master -u -- "*.py" | flake8 --diff
Numpy arrays don't have the datetimetz dtype, so I just passed through the DatetimeIndex directly.
Side note: There's another small bug (I think) where np.nan or pd.NaT takes on the dtype of the column instead of the row when concatenating, but the column should instead have an object dtype.